Appendix C — Assignment C

Diagnostics & Remedial Measures

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Make R code chunks to insert code and type your answer outside the code chunks. Ensure that the solution is written neatly enough to understand and grade.

  3. Render the file as HTML to submit. For theoretical questions, you can either type the answer and include the solutions in this file, or write the solution on paper, scan and submit separately.

  4. The assignment is worth 100 points, and is due on 22nd October 2023 at 11:59 pm.

  5. There is an extra credit question worth 10 points in the end. You can score 110/100 in the assignment.

  6. Five points are properly formatting the assignment. The breakdown is as follows:

  • Must be an HTML file rendered using Quarto (the theory part may be scanned and submitted separately) (2 pts).
  • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.). There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)
  • Final answers of each question are written clearly (1 pt).
  • The proofs are legible, and clearly written with reasoning provided for every step. They are easy to follow and understand (1 pt)

C.1 Real estate sales

Read the file real_estate_sales.txt.

C.1.1

Develop a linear regression model to predict price of a house based on square_feet (floor area of the house). Check the following model assumptions using diagnostic plots:

  1. Linear relationship

  2. Homoscedasticity

  3. Normal distribution of errors

For each of the above assumption, comment if it is appears to be satisfied or violated based on the plots.

(2 + 2 + 2 = 6 points)

C.1.2

Consider performing the statistical test to check the linear relationship assumption. What condition must be satisfied by the data for the test to be performed? Show that the condition is satisfied by this data.

(1 + 2 = 3 points)

C.1.3

Perform the statistical test to check the linear relationship assumption. What is your conclusion?

(2 points)

C.1.4

Check assumptions (2) and (3) as mentioned in C.1.1 with statistical tests, and mention your conclusion.

(2 + 2 = 4 points)

C.1.5

Use the Box-Cox procedure to identify the appropriate transformation of the regression model developed in C.1.1. Write the transformed model equation.

(3 + 2 = 5 points)

C.1.6

Check all the 3 model assumptions mentioned in C.1.1 for the transformed model developed in C.1.5. Use both diagnostic plots and statistical tests. Mention your comments based on the plots and conclusions based on the tests.

(2 + 2 + 2 = 6 points)

C.1.7

Is the transformed model (developed in C.1.5) better than the original model (developed in C.1.1) with respect to the model assumptions? Comment based on the tests/plots in the previous question (C.1.6).

(2 points)

C.1.8

If the linearity assumption is still not satisfied in the transformed model (developed in C.1.5),

  1. Propose another transformation based on the diagnostic plot(s) plotted in the previous question.

  2. Write the transformed model equation.

  3. Show that the transformed model (based on the proposed transformation in (a)) satisfies the linearity assumption based on the statistical test, if the probability of type I error considered is 1%.

  4. Also make the diagnostic plot for the transformed model (based on the proposed transformation in (a)) to show that it satisfies the linearity assumption.

(2 + 2 + 2 + 2 = 8 points)

C.1.9

Does the transformed model developed in the previous question (C.1.8) satisfy the homoscedasticity and normally distributed errors assumptions? Verify based on diagnostic plots and statistical tests.

(2 + 2 = 4 points)

C.1.10

Plot all the three models - the original model (developed in C.1.1), the boxcox transformed model (developed in C.1.5), and the final model (developed in C.1.8) over a scatterplot of price vs square_feet. Which model seems to have the best fit? Also report the \(R^2\) of all the 3 models.

(1 + 3 + 1 + 3 = 8 points)

C.2 Mortality vs Income

The dataset infmort.csv gives the infant mortality of different countries in the world. The column mortality contains the infant mortality in deaths per 1000 births.

C.2.1

Read the dataset. There are 4 observations that have missing values. Remove those observations from the dataset.

Hint: You may use the R function complete.cases().

(2 points)

C.2.2

Over the scatterplot of mortality against income, plot the regression model predicting mortality based on income. Report the \(R^2\) for this model.

(2 + 1 = 3 points)

C.2.3

Are there outlying observations in the data with respect to the model developed in the previous question (C.2.2)? How many? Consider observations having a magnitude of standardized residual more than 5 as outliers.

(2 points)

C.2.4

Based on the plot in C.2.2, propose a transformation for income. Justify the proposed transformation. Write the equation of the transformed model and report its \(R^2\).

(2 + 2 + 1 + 1 = 6 points)

C.2.5

Plot the transformed model developed in the previous question over a scatterplot of mortality vs the transformed income (as transformed in the previous question (C.2.4)). Did the model fit of the transformed model (developed in C.2.4) improve over the fit of the original model (developed in C.2.2)?

(2 + 1 = 3 points)

C.2.6

Use Box-Cox to identify the appropriate transformation for mortality in the transformed model (developed in C.2.4). Write the equation of the Box-Cox transformed model, and report its \(R^2\).

(3 points)

C.2.7

Plot the Box-Cox transformed model developed in the previous question (C.2.6) over a scatterplot of the transformed mortality vs the transformed income. Did the fit of the Box-Cox transformed model improve over the fit of the transformed model (developed in C.2.4)?

(2 + 1 = 3 points)

C.2.8

Plot the Box-Cox transformed model (developed in C.2.6) over the scatterplot of mortality vs income.

(3 points)

C.2.9

Are there outlying observations in the data with respect to the Box-Cox transformed model (developed in C.2.6)? If not, then how did they disappear given that there were outliers in the original model (developed in C.2.2)?

(1 + 3 = 4 points)

C.3

Suppose the error terms in the following linear regression model are independent \(N(0, \sigma^2)\):

\(Y_i = \beta_0 + \beta_1X_i + \epsilon_i, \epsilon_i \sim N(0, \sigma^2)\).

C.3.1

If \(X_i\) is transformed to \(X_i'=1/X_i\), then will the error terms still be independent \(N(0, \sigma^2)\)? If not, then what will be the change in their distribution?

(4 points)

C.3.2

Instead of \(X_i\), if \(Y_i\) is transformed to \(Y_i'=1/Y_i\), then will the error terms still be independent \(N(0, \sigma^2)\)? If not, then what will be the change in their distribution?

(4 points)

C.4

A simple linear regression model with intercept \(\beta_0 = 0\) is under consideration. Data have been obtained that contain replications.

State the full and reduced models for testing the appropriateness of the regression function under consideration.

What are the degrees of freedom associated with the full and reduced model if number of observations \(n=20\) and number of distinct values of the predictor \(c=10\)?

(2 + 2 = 4 points)

C.5

Let the observed value of the response variable for the \(i^{th}\) replicate of the \(j^{th}\) level of the predictor \(X\) be \(Y_{ij}\), where \(i=1,...,n_j\), \(j= 1,...,c\).

The fitted regression model is:

\(\hat{Y}_{ij} = \hat{\beta}_0+\hat{\beta}_1X_j\)

The error deviation can be decomposed as the pure error deviation, and the lack of fit deviation as follows:

\(Y_{ij}-\hat{Y}_{ij} = (Y_{ij}-\bar{Y}_j) + (\bar{Y}_j-\hat{Y}_{ij})\)

Given the above equation, show that:

\(\sum_{i=1}^{n_j}\sum_{j=1}^c(Y_{ij}-\hat{Y}_{ij})^2 = \sum_{i=1}^{n_j}\sum_{j=1}^c(Y_{ij}-\bar{Y}_j)^2 + \sum_{i=1}^{n_j}\sum_{j=1}^c(\bar{Y}_j-\hat{Y}_{ij})^2\).

(6 points)

C.6 Bonus question

(This is extra credit question, you are not required to do it)

For \(\lambda \ne 0\), the Box-Cox transformation is given as:

\(y_i^{(\lambda)}=\frac{y_i^\lambda-1}{\lambda}...(1)\)

However, typically practitioners transform the response as:

\(y_i^{(\lambda)}=y_i^\lambda...(2)\).

Why is (2) acceptable in general even when (1) is the transformation proposed by George Box and David Cox? In which cases will it be detrimental to use (2) instead of (1), and one must use (1)?

Support your arguments with examples based on simulated data to answer this question. You must provide an example when both (1) and (2) are acceptable, and another example when (2) is not effective, and only (1) must be used.

(10 points)